Database and Backup

Database Files

When using Datasets, you may have noticed that a file of the form dataset-abcd1234.yaml is created in the working directory where the dataset is being run. This is what’s known as a Database file, and is where the Dataset stores info.

This functionality allows the skip options to function across notebook restarts.

However, the Database is more powerful than just a storage of important data, in fact it actually stores almost all data.

Dataset Permanence

It is because of this total info store that we can recreate a Dataset from a Database file at any time. This is in fact what happens when you restart your notebook with skip=True in the Dataset initialisation (the default).

But this allows us to use this functionality in reverse, “packing” a dataset into a file that can be transferred or backed up.

Lets create a dataset to play with:

[1]:
from remotemanager import Dataset

def function(a, b):
    return a * b

ds = Dataset(function, skip=False)

Note

It is important to note here the behaviour of the skip parameter. When a dataset is created, it will search for a Database file that matches the parameters given. If found, it will unpack itself from that file by default. If skip=False, it will instead delete that file and recreate itself in place.

[2]:
runs = [
    [1, 10],
    [7, 5],
    [12, 3]
]

for run in runs:
    ds.append_run({"a": run[0], "b": run[1]})
# run the dataset
ds.run()
# wait for the completion, checking every 1 second, up to a maximum of 10 seconds
ds.wait(1, 10)
# collect the results
ds.fetch_results()
# check the results
ds.results
appended run runner-0
appended run runner-1
appended run runner-2
Staging Dataset... Staged 3/3 Runners
Transferring for 3/3 Runners
Transferring 9 Files... Done
Remotely executing 3/3 Runners
Fetching results
Transferring 6 Files... Done
[2]:
[10, 35, 36]

Now we have a completed run, lets explore some situations where the Database helps us.

Notebook Restarts

The most common use of these files is done automatically for you if a notebook is killed and restarted. If a Dataset is created without skip=False, it will recreate itself if it can. Lets simulate a restart here by deleting the dataset:

[3]:
del ds

Now, the dataset no longer exists within the notebook, exactly as if we had killed the notebook and restarted. Lets recreate it as we are rerunning:

[4]:
ds = Dataset(function)

ds.results
[4]:
[10, 35, 36]

Since the dataset was recreated, it still contains everything necessary to continue as if it was never deleted. If we tried to run, it will skip, since the runs have already succeeded:

[5]:
ds.run()
Staging Dataset... No Runners staged
No Transfer required
[5]:
False

In short, this means that the often intensive and long calculations are independent of the notebook. You do not risk resubmitting a large job if you accidentally close your notebook and rerun.

Notebook Transfers

Since the notebook and the database are just files, this also allows you to transfer your datasets to another machine or person. Simply copy across the notebook, along with the database, and remotemanager will attempt to run as if nothing has changed.

Important

Note that while the Dataset will attempt to run as normal, outside factors such as the python environment can still affect the runtime.

Important

If your Dataset requires (or creates) extra files that are needed for your workflow, be sure to have these at the same relative location to the new working directory. Later in this tutorial we will cover Dataset.backup, which automates more of this process for you.

Renaming the File

The automatically generated filename for the dataset can be complicated to remember. If you have multiple datasets running, even impossible to distinguish. It is possible to influence this file name in many different ways:

  • Give the Dataset a name

  • Set the dbfile parameter

  • Pack to a custom file

Lets go through these now.

Naming the Dataset

Datasets can be given a name parameter, which makes their files easier to identify. First, lets take a look at the filename for the dataset created earlier.

[6]:
ds.dbfile
[6]:
'dataset-d8ecb370.yaml'

Not exactly memorable. Lets recreate with skip=False and give the new Dataset a name:

[7]:
ds = Dataset(function, name="functiontest", skip=False)

ds.dbfile
[7]:
'dataset-functiontest-291a69ef.yaml'

Now our name has been added to the filename, making it somewhat easier to find.

Specifying the filename

If you want to go one step further and customise the filename, the dbfile parameter that we’ve been checking can also be set at initialisation. This sets the filename that is used, so it can be whatever you want.

[8]:
ds = Dataset(function, dbfile="dataset_custom_filename", skip=False)

ds.dbfile
[8]:
'dataset_custom_filename.yaml'

Note

Since Databases are in yaml format, if you omit this ending from your dbfile, it will add it for you.

Packing to a Custom File

The functionalities that are used for the Database file are open for use by the user, and they do not always have to target the same file. You can pack and recreate from a file of your choosing, without touching the Database.

[9]:
ds.pack(file="temporary_dataset_pack")

import os
os.path.isfile("temporary_dataset_pack")
dumping payload to temporary_dataset_pack
[9]:
True

This method of storage does not enforce the yaml file extension, though the actual file content is still of the yaml format internally.

We can recreate from this file using Dataset.from_file()

Added in version 0.13.4: After the changes to how Computer is serialised, from_file now requires a url to be passed. Otherwise, a default (localhost) one will be created.

[10]:
del ds

ds = Dataset.from_file("temporary_dataset_pack")

ds.dbfile
[10]:
'dataset_custom_filename.yaml'

Note how the dbfile has not changed, as this pack/recreate is considered a “temporary” method of transfer.

Backup and Restore

Added in version 0.9.16.

It was mentioned earlier that there is a more advanced method for backup and restore than that which we have just covered. This system automatically handles returned files in addition to the dataset itself, so is more robust in the face of a Dataset which also uses files.

You should keep in mind that this method only handles returned files. This only includes:

- result
- error
- extra_files_recv

To demonstrate this, it is best to create a dataset that does return files:

[11]:
def to_file(inp, fname):
    with open(fname, "w+") as o:
        o.write(str(inp))

ds = Dataset(to_file, skip=False)

ds.append_run({"inp": "test", "fname": "test.out"}, extra_files_recv = "test.out")

ds.run()
ds.wait(1, 10)
ds.fetch_results()
ds.results
appended run runner-0
Staging Dataset... Staged 1/1 Runners
Transferring for 1/1 Runners
Transferring 5 Files... Done
Remotely executing 1/1 Runners
Fetching results
Transferring 3 Files... Done
[11]:
[None]

Our function here does not return anything, so the results property holds no data. The real information is within the file that is returned.

Note

It is considered good practice to have your functions returns something. This can make it much easier to fix problems. In this case, it would be wise to have the function return fname at the least, so we know that the function has completed as expected.

To access our “result”, we should read the content of the returned file. The extra files are a special TrackedFile class which can help with this.

Lets get the runner that we want to see (index 0), then check its list of extra files to recieve. Since there’s only one, we take the first index again, and print the content property of the TrackedFile that is there.

[12]:
print(ds.runners[0].extra_files_recv[0].content)
test

Limitations of the Database

Since the database only handles the properties of the Dataset and its runners directly, these extra files are only “tracked”. So if they were to be deleted, moved, or renamed, the Dataset is essentially broken. If we delete the local file, even if we restore from a pack, the file contents will be gone:

[13]:
ds.pack(file="dataset_with_files_backup.yaml")

try:
    os.remove(ds.runners[0].extra_files_recv[0].local)
except FileNotFoundError:
    print("could not remove file")

ds = Dataset.from_file(file="dataset_with_files_backup.yaml")

print(ds.runners[0].extra_files_recv[0].content)
dumping payload to dataset_with_files_backup.yaml
None

Backup and Restore

So in this situation, if we wanted to ensure the safety of our data, we should use the backup method. Lets fetch the results again to repopulate the files and demonstrate:

[14]:
ds.fetch_results()

print(ds.runners[0].extra_files_recv[0].content)
Fetching results
Transferring 2 Files... Done
test

Now, do as before, using backup and restore.

[15]:
ds.backup(file="full_backup.zip", full=True, force=True)

ds.hard_reset(files_only=True, confirm=False)

ds = Dataset.restore(file="full_backup.zip")

print(ds.runners[0].extra_files_recv[0].content)
test

Usage Details

There is a few things to note here, the first of which being that the filetype has to be .zip. If this is not the case, you’ll get an error.

Note

This limitation is due to using the Python inbuilt ZipFile module.

Secondly, here we’re using force=True. backup by default will not overwrite a file if it already exists, again raising an error. If you want to overwrite anyway, you can use force=True to overwrite the backup.